DeepChem supports a whole range of input files. For example, accepted input formats for deepchem include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by Loader
class associated with that format. For example, with a csv input, we use the CSVLoader
class underneath the hood. Here's an example of a sample .csv file that fits the requirements of CSVLoader
.
Here's an example of a potential input file.
Compound ID | measured log solubility in mols per litre | smiles |
---|---|---|
benzothiazole | -1.5 | c2ccc1scnc1c2 |
Here the "smiles" column contains the SMILES string, the "measured log solubility in mols per litre" contains the experimental measurement and "Compound ID" contains the unique compound identifier.
[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line notation and computerized interpreter for chemical structures." US Environmental Protection Agency, Environmental Research Laboratory, 1987.
Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the format of lists of molecules and associated experimental readouts. To
transform lists of molecules into vectors, we need to subclasses of DeepChem loader class dc.data.DataLoader
such as dc.data.CSVLoader
or dc.data.SDFLoader
. Users can subclass dc.data.DataLoader
to
load arbitrary file formats. All loaders must be passed a dc.feat.Featurizer
object. DeepChem provides a number of different subclasses of dc.feat.Featurizer
for convenience.